[feature] support python scalar udf:video_snapshot for video#336
Merged
Conversation
5cf96de to
3dabbfe
Compare
JingsongLi
requested changes
May 20, 2026
Contributor
JingsongLi
left a comment
There was a problem hiding this comment.
We should take a look to datafusion-python register_udf API.
3927ae2 to
633e27c
Compare
1e9d894 to
d9a094e
Compare
Contributor
|
Find a better name for first_frame. |
d9a094e to
eafe6b7
Compare
4c231bc to
aeec8b0
Compare
JingsongLi
reviewed
May 21, 2026
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations |
Contributor
There was a problem hiding this comment.
Extract UDF-related types into udf.rs and blob reader/stream types into blob.rs. Keep context.rs focused on catalog and SQLContext.
aeec8b0 to
3b8a42c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This pull request adds Python scalar UDF support for the PyPaimon Rust DataFusion SQL context and uses that bridge to expose a built-in
video_snapshot(video[, timestamp_ms])SQL function.The generic API follows the DataFusion Python shape:
SQLContext()auto-registers thevideo_snapshotbuilt-in once at session creation time, so callers can use it directly from SQL:The optional second argument follows the OSS
video/snapshot,t_<ms>shape: it is a millisecond timestamp and defaults to0, sovideo_snapshot(video)captures the first decodable video frame, whilevideo_snapshot(video, 5000)captures around 5 seconds. If the requested timestamp is beyond the video duration, the function returns the last decodable frame.Built-in registration is best-effort during
SQLContext()construction: if the package's own built-in module cannot be imported or registered, construction still succeeds and emits aRuntimeWarning. PyAV/Pillow are still imported lazily whenvideo_snapshot(...)executes, not during construction.video_snapshotaccepts a binary video value. If the value is a serialized PaimonBlobDescriptor, it opens the referenceduriwith a seekable/range-readable stream and restricts reads to the descriptoroffset/length. Otherwise, it treats the binary value as inline video bytes. The current built-in emits PNG bytes for videos and returns NULL for still images or other values it cannot process as video.For real Paimon tables, the DataFusion table provider registers the table's
FileIOand table location prefix in a session-scoped blob reader registry. The built-in first tries that registry, using the canonical RustBlobDescriptorparser and the same tableFileIOthat the scan uses. This lets BlobDescriptor reads reuse catalog/table storage configuration, including object-store options supported by the tableFileIOsuch as OSS/S3 credentials. If no registered table location prefix matches a BlobDescriptor URI, the row returns NULL; Python does not directly open descriptor URIs.Decode/IO failures are logged as warnings and return NULL for that row. Missing PyAV/Pillow still raises so dependency problems are visible. The v1 implementation decodes rows serially in Python, so this built-in is intended for cover/snapshot extraction after filtering to the desired rows rather than large unbounded scans.
video_snapshot Usage
video_snapshotis a SQL built-in registered automatically bySQLContext().Behavior:
video_snapshot(video BINARY) -> BINARYandvideo_snapshot(video BINARY, timestamp_ms INT/BIGINT) -> BINARY.timestamp_msdefaults to0, matching the OSSvideo/snapshot,t_<ms>shape.timestamp_msis beyond the video duration, the function returns the last decodable frame.videois a PaimonBlobDescriptor, the function first reads it through the tableFileIOregistered by the DataFusion table provider when the descriptor URI is under a registered table location prefix, so catalog/table storage credentials such as OSS/S3 options are reused.BlobDescriptorURI does not resolve through a registered table location prefix, the function returnsNULL; there is no direct Python HTTP/file URI fallback.videois not aBlobDescriptor, the function treats the binary value as inline video bytes.NULLvideo,NULLtimestamp, negative timestamp, still images, non-video bytes, unreadable blobs, or decode failures returnNULL.video_snapshotfor large tables.m_fastare left for future extension.Tests
Note: the full
test_datafusion.pysuite still requires the external/tmp/paimon-warehousefixture fortest_query_simple_table_via_catalog_provider; without that fixture it fails withtable 'paimon.default.simple_log_table' not found.API and Format
Adds Python scalar UDF APIs to
pypaimon_rust.datafusion:PythonScalarUDFudf(...)SQLContext.register_udf(...)Adds BlobDescriptor stream helpers in
pypaimon_rust.functionsused by the built-in.video_snapshotitself is registered automatically bySQLContext().The callable receives PyArrow arrays and must return a PyArrow array with the declared return type and matching row count.
input_fieldsandreturn_fieldaccept PyArrowDataTypeorFieldvalues; string type names remain accepted for compatibility. If no UDF name is provided, the default name is derived from the callable and sanitized to a SQL-friendly identifier.No storage format changes.
Dependencies
The built-in imports PyAV/Pillow lazily when
video_snapshot(...)executes. The package exposes avideoextra and the Python dev test environment includesavandpillow.